{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Using OpenAI's Latest Embedding Model with Milvus\n",
    "\n",
    "Milvus is the world's first open source vector database. With scalability and advanced features such as metadata filtering, Milvus has became a crucial component that empowers semantic search with deep learning technique - embedding models.\n",
    "\n",
    "On January 25, OpenAI released 2 latest embedding models, `text-embedding-3-small` and `text-embedding-3-large`. Both embedding models has better performance over `text-embedding-ada-002`. The `text-embedding-3-small` is a highly efficient model. With 5X cost reduction, it achieves slight higher [MTEB](https://huggingface.co/spaces/mteb/leaderboard) score of 62.3% compared to 61%. `text-embedding-3-large` is OpenAI's best performing model, with 64.6% MTEB score.\n",
    "\n",
    "![](../../images/openai_embedding_scores.png)\n",
    "\n",
    "More impressively, both models support trading-off performance and cost with a technique called \"Matryoshka Representation Learning\". Users can get shorten embeddings for vast reduction of the vector storage cost, without sacrificing the retrieval quality much. For example, reducing the vector dimension from 3072 to 256 only reduces the MTEB score from 64.6% to 62%. However, it achieves 12X cost reduction!\n",
    "\n",
    "![](../../images/openai_embedding_vector_size.png)\n",
    "\n",
    "This tutorial shows how to use OpenAI's newest embedding models with Milvus for semantic similarity search."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Preparations\n",
    "\n",
    "We will demonstrate with `text-embedding-3-small` model and Milvus in Standalone mode. The text for searching comes from the [blog post](https://openai.com/blog/new-embedding-models-and-api-updates) that annoucements the new OpenAI model APIs. For each sentence in the blog, we use `text-embedding-3-small` model to convert the text string into 1536 dimension vector embedding, and store each embedding in Milvus.\n",
    "\n",
    "We then search a query by converting the query text into a vector embedding, and perform vector Approximate Nearest Neighbor search to find the text strings with cloest semantic.\n",
    "\n",
    "To run this demo you'll need to obtain an API key from [OpenAI website](https://openai.com/product). Be sure you have already [started up a Milvus instance](https://milvus.io/docs/install_standalone-docker.md) and installed python client library with `pip install pymilvus openai`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from openai import OpenAI\n",
    "from pymilvus import (\n",
    "    connections,\n",
    "    utility,\n",
    "    FieldSchema,\n",
    "    CollectionSchema,\n",
    "    DataType,\n",
    "    Collection,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up the options for Milvus, specify OpenAI model name as `text-embedding-3-small`, and enter your OpenAI API key upon prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "MILVUS_HOST = \"localhost\"\n",
    "MILVUS_PORT = \"19530\"\n",
    "COLLECTION_NAME = \"openai_doc_collection\"  # Milvus collection name\n",
    "EMBEDDING_MODEL = \"text-embedding-3-small\"  # OpenAI embedding model name, you can change it into `text-embedding-3-large` or `text-embedding-ada-002`\n",
    "\n",
    "client = OpenAI()  # Initialize an Open AI client\n",
    "client.api_key = os.getenv('OPENAI_API_KEY')  # Use your own Open AI API Key or set it in the environment variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Let’s try the OpenAI Embedding service with a text string, print the result vector embedding and get the dimensions of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.00514861848205328, 0.017234396189451218, -0.018690429627895355, -0.01859242655336857, -0.04732108861207962, -0.030296696349978447, 0.027692636474967003, 0.003640083596110344, 0.011249258182942867, 0.006401647347956896, -0.0016966640250757337, 0.0157923623919487, -0.0013186553260311484, -0.007833180017769337, 0.059921376407146454, 0.050261154770851135, -0.027538632974028587, 0.009940228424966335, -0.04040492698550224, 0.05000915005803108] ...\n",
      "\n",
      "Dimensions of `text-embedding-3-small` embedding model is: 1536\n"
     ]
    }
   ],
   "source": [
    "response = client.embeddings.create(\n",
    "    input=\"Your text string goes here\",\n",
    "    model=EMBEDDING_MODEL\n",
    ")\n",
    "res_embedding = response.data[0].embedding\n",
    "print(f'{res_embedding[:20]} ...')\n",
    "dimension = len(res_embedding)\n",
    "print(f'\\nDimensions of `{EMBEDDING_MODEL}` embedding model is: {dimension}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Load vectors to Milvus\n",
    "\n",
    "We set up a collection in Milvus and build index so that we can efficiently search vectors. For more information on how to use Milvus, look [here](https://milvus.io/docs/example_code.md).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Status(code=0, message=)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Connect to Milvus\n",
    "connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)\n",
    "\n",
    "# Remove collection if it already exists\n",
    "if utility.has_collection(COLLECTION_NAME):\n",
    "    utility.drop_collection(COLLECTION_NAME)\n",
    "\n",
    "# Set scheme with 3 fields: id (int), text (string), and embedding (float array).\n",
    "fields = [\n",
    "    FieldSchema(name=\"pk\", dtype=DataType.INT64, is_primary=True, auto_id=False),\n",
    "    FieldSchema(name=\"text\", dtype=DataType.VARCHAR, max_length=65_535),\n",
    "    FieldSchema(name=\"embeddings\", dtype=DataType.FLOAT_VECTOR, dim=dimension)\n",
    "]\n",
    "schema = CollectionSchema(fields, \"Here is description of this collection.\")\n",
    "# Create a collection with above schema.\n",
    "doc_collection = Collection(COLLECTION_NAME, schema)\n",
    "\n",
    "# Create an index for the collection.\n",
    "index = {\n",
    "    \"index_type\": \"IVF_FLAT\",\n",
    "    \"metric_type\": \"L2\",\n",
    "    \"params\": {\"nlist\": 128},\n",
    "}\n",
    "doc_collection.create_index(\"embeddings\", index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Here we have prepared a data source, which is crawled from the latest [blog](https://openai.com/blog/new-embedding-models-and-api-updates#fn-A) of Open AI, and its name is `openai_blog.txt`. It stores each sentence as a line, and we convert each line in the document into a vector with `text-embedding-3-small` and then insert these embeddings into Milvus collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "with open('./docs/openai_blog.txt', 'r') as f:\n",
    "    lines = f.readlines()\n",
    "\n",
    "embeddings = []\n",
    "for line in lines:\n",
    "    response = client.embeddings.create(\n",
    "        input=line,\n",
    "        model=EMBEDDING_MODEL\n",
    "    )\n",
    "    embeddings.append(response.data[0].embedding)\n",
    "\n",
    "entities = [\n",
    "    list(range(len(lines))),  # field id (primary key) \n",
    "    lines,  # field text\n",
    "    embeddings,  #field embeddings\n",
    "]\n",
    "insert_result = doc_collection.insert(entities)\n",
    "\n",
    "# After final entity is inserted, it is best to call flush to have no growing segments left in memory\n",
    "doc_collection.flush()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Query\n",
    "\n",
    "Here we will build a `semantic_search` function, which is used to retrieve the topK most semantically similar document from a Milvus collection.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Load the collection into memory for searching\n",
    "doc_collection.load()\n",
    "\n",
    "\n",
    "def semantic_search(query, top_k=3):\n",
    "    response = client.embeddings.create(\n",
    "        input=query,\n",
    "        model=EMBEDDING_MODEL\n",
    "    )\n",
    "    vectors_to_search = [response.data[0].embedding]\n",
    "    search_params = {\n",
    "        \"metric_type\": \"L2\",\n",
    "        \"params\": {\"nprobe\": 10},\n",
    "    }\n",
    "    result = doc_collection.search(vectors_to_search, \"embeddings\", search_params, limit=top_k, output_fields=[\"text\"])\n",
    "    return result[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Here we ask questions about the price of the latest embedding models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "distance = 0.50\n",
      "Pricing for `text-embedding-3-small` has therefore been reduced by 5X compared to `text-embedding-ada-002`, from a price per 1k tokens of $0.0001 to $0.00002.\n",
      "\n",
      "distance = 0.56\n",
      "`text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding- ada-002` model released in December 2022.\n",
      "\n",
      "distance = 0.56\n",
      "**`text-embedding-3-large` is our new best performing model.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "question = 'What is the price of the `text-embedding-3-small` model?'\n",
    "\n",
    "match_results = semantic_search(question, top_k=3)\n",
    "for match in match_results:\n",
    "    print(f\"distance = {match.distance:.2f}\\n{match.text}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The smaller the distance, the closer the vector is, that is, semantically more similar. We can see that the top 1 results returned can answer this question.\n",
    "\n",
    "Let's try another question, it's a question about the new GPT-4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "distance = 0.97\n",
      "Over 70% of requests from GPT-4 API customers have transitioned to GPT-4 Turbo since its release, as developers take advantage of its updated knowledge cutoff, larger 128k context windows, and lower prices.\n",
      "\n",
      "distance = 1.02\n",
      "Today, we are releasing an updated GPT-4 Turbo preview model, `gpt-4-0125-preview`.\n",
      "\n",
      "distance = 1.02\n",
      "* Overview * Index * GPT-4 * DALL·E 3\n",
      "\n"
     ]
    }
   ],
   "source": [
    "question = 'What is the context window size of GPT-4??'\n",
    "\n",
    "match_results = semantic_search(question, top_k=3)\n",
    "for match in match_results:\n",
    "    print(f\"distance = {match.distance:.2f}\\n{match.text}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Our semantic retrieval is able to identify the meaning of our queries and return the most semantically similar documents from Milvus collection.\n",
    "\n",
    "We can delete this collection to save resources."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Drops the collection\n",
    "utility.drop_collection(COLLECTION_NAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is how to use OpenAI embedding model and Milvus to perform semantic search. Milvus has also integrated with other model providers such as Cohere and HuggingFace, you can learn more at https://milvus.io/docs."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
